Unsupervised detection of anomalous text

نویسنده

  • David Guthrie
چکیده

This thesis describes work on the detection of anomalous material in text without the use of training data. We use the term anomalous to refer to text that is irregular, or deviates significantly from its surrounding context. In this thesis we show that identifying such abnormalities in text can be viewed as a type of outlier detection because these anomalies will differ significantly from the writing style in the majority of the data. We consider segments of text which are anomalous with respect to topic (i.e. about a different subject), author (written by a different person), or genre (written for a different audience or from a different source) and experiment with whether it is possible to identify these anomalous segments automatically. Five different innovative approaches to this problem are introduced and assessed using many experiments over large document collections, created to contain randomly inserted anomalous segments. In order to identify anomalies in text successfully, we investigate and evaluate 166 stylistic and linguistic features used to characterize writing, some of which are well-established stylistic determiners, but many of which are original. Using these features with each of our methods, we examine the effect of segment size on our ability to detect anomaly, allowing segments of size 100 words, 500 words and 1000 words. We show substantial improvements over a baseline in all cases for all methods, and identify a novel method which performs consistently better than others and the features that contribute most to unsupervised anomaly detection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Learning-based Anomalous Arabic Text Detection

The growing dependence of modern society on the Web as a vital source of information and communication has become inevitable. However, the Web has become an ideal channel for various terrorist organisations to publish their misleading information and send unintelligible messages to communicate with their clients as well. The increase in the number of published anomalous misleading information o...

متن کامل

A Probabilistic Approach to Aggregating Anomalies for Unsupervised Anomaly Detection with Industrial Applications

This paper presents a novel, unsupervised approach to detecting anomalies at the collective level. The method probabilistically aggregates the contribution of the individual anomalies in order to detect significantly anomalous groups of cases. The approach is unsupervised in that as only input, it uses a list of cases ranked according to its individual anomaly score. Thus, any anomaly detection...

متن کامل

An Unsupervised Cooperative Pattern Recognition Model to Identify Anomalous Massive SNMP Data Sending

In this paper, we review a visual approach and propose it for analysing computer-network activity, which is based on the use of unsupervised connectionist neural network models and does not rely on any previous knowledge of the data being analysed. The presented Intrusion Detection System (IDS) is used as a method to investigate the traffic which travels along the analysed network, detecting SN...

متن کامل

Recurrent Neural Network Language Models for Open Vocabulary Event-Level Cyber Anomaly Detection

Automated analysis methods are crucial aids for monitoring and defending a network to protect the sensitive or confidential data it hosts. This work introduces a flexible, powerful, and unsupervised approach to detecting anomalous behavior in computer and network logs; one that largely eliminates domain-dependent feature engineering employed by existing methods. By treating system logs as threa...

متن کامل

BotOnus: an online unsupervised method for Botnet detection

Botnets are recognized as one of the most dangerous threats to the Internet infrastructure. They are used for malicious activities such as launching distributed denial of service attacks, sending spam, and leaking personal information. Existing botnet detection methods produce a number of good ideas, but they are far from complete yet, since most of them cannot detect botnets in an early stage ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008